https://www.kaggle.com/code/alperahin111/insurance-prediction/notebook
**Libraries:**
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
**Multiple Regression Problem:**
**Exploratory Data Analysis:**
dataset = pd.read_csv('expenses.csv')
dataset
| age | sex | bmi | children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
| 1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
| 2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
| 3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
| 4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1333 | 50 | male | 30.970 | 3 | no | northwest | 10600.54830 |
| 1334 | 18 | female | 31.920 | 0 | no | northeast | 2205.98080 |
| 1335 | 18 | female | 36.850 | 0 | no | southeast | 1629.83350 |
| 1336 | 21 | female | 25.800 | 0 | no | southwest | 2007.94500 |
| 1337 | 61 | female | 29.070 | 0 | yes | northwest | 29141.36030 |
1338 rows × 7 columns
shape_first = dataset.shape[0]
dataset.drop_duplicates(inplace = True)
print( shape_first - dataset.shape[0], ' --> removed from raw data.' )
dataset
1 --> removed from raw data.
| age | sex | bmi | children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
| 1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
| 2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
| 3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
| 4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1333 | 50 | male | 30.970 | 3 | no | northwest | 10600.54830 |
| 1334 | 18 | female | 31.920 | 0 | no | northeast | 2205.98080 |
| 1335 | 18 | female | 36.850 | 0 | no | southeast | 1629.83350 |
| 1336 | 21 | female | 25.800 | 0 | no | southwest | 2007.94500 |
| 1337 | 61 | female | 29.070 | 0 | yes | northwest | 29141.36030 |
1337 rows × 7 columns
b = sns.countplot(data=dataset, x='region')
b.axes.set_title("Distribution of Region",fontsize=20, fontweight = 'bold')
b.set_xlabel("Regions",fontsize=14,fontweight = 'bold')
b.set_ylabel("Sample Sizes",fontsize=14, fontweight = 'bold')
b.tick_params(labelsize=14)
plt.show()
b = sns.countplot(data=dataset, x='sex')
b.axes.set_title("Distribution of Gender",fontsize=20, fontweight = 'bold')
b.set_xlabel("Gender",fontsize=14,fontweight = 'bold')
b.set_ylabel("Sample Sizes",fontsize=14, fontweight = 'bold')
b.tick_params(labelsize=14)
plt.show()
b = sns.countplot(data=dataset, x='smoker')
b.axes.set_title("Distribution of Smoking Habit",fontsize=20, fontweight = 'bold')
b.set_xlabel("Smoking Habit",fontsize=14,fontweight = 'bold')
b.set_ylabel("Sample Sizes",fontsize=14, fontweight = 'bold')
b.tick_params(labelsize=14)
plt.show()
dataset.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1337 entries, 0 to 1337 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1337 non-null int64 1 sex 1337 non-null object 2 bmi 1337 non-null float64 3 children 1337 non-null int64 4 smoker 1337 non-null object 5 region 1337 non-null object 6 charges 1337 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 83.6+ KB
[ print(str(col+ ':'), dataset[col].unique()) for col in ['sex', 'smoker', 'region'] ]
sex: ['female' 'male'] smoker: ['yes' 'no'] region: ['southwest' 'southeast' 'northwest' 'northeast']
[None, None, None]
**Dummy Variables for categorical Values:**
# Sex
dataset['Sex_int'] = dataset['sex'].apply(lambda x: 0 if 'female' == x else 1)
'''
Female = 0
Male = 1
'''
# Smoking Habit
dataset['smoker_int'] = dataset['smoker'].apply(lambda x: 0 if 'no' == x else 1)
'''
No = 0
Yes = 1
'''
# Bölge
dataset_new = pd.get_dummies(dataset, columns = ['region'] )
dataset_new
| age | sex | bmi | children | smoker | charges | Sex_int | smoker_int | region_northeast | region_northwest | region_southeast | region_southwest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 19 | female | 27.900 | 0 | yes | 16884.92400 | 0 | 1 | 0 | 0 | 0 | 1 |
| 1 | 18 | male | 33.770 | 1 | no | 1725.55230 | 1 | 0 | 0 | 0 | 1 | 0 |
| 2 | 28 | male | 33.000 | 3 | no | 4449.46200 | 1 | 0 | 0 | 0 | 1 | 0 |
| 3 | 33 | male | 22.705 | 0 | no | 21984.47061 | 1 | 0 | 0 | 1 | 0 | 0 |
| 4 | 32 | male | 28.880 | 0 | no | 3866.85520 | 1 | 0 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1333 | 50 | male | 30.970 | 3 | no | 10600.54830 | 1 | 0 | 0 | 1 | 0 | 0 |
| 1334 | 18 | female | 31.920 | 0 | no | 2205.98080 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1335 | 18 | female | 36.850 | 0 | no | 1629.83350 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1336 | 21 | female | 25.800 | 0 | no | 2007.94500 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1337 | 61 | female | 29.070 | 0 | yes | 29141.36030 | 0 | 1 | 0 | 1 | 0 | 0 |
1337 rows × 12 columns
'''New columns belonging to 4 different regions were created. If any three of these columns are zero, the remaining column
can be also known; consequently it was removed from the dataset. Otherwise, multicollinearity problem arises and the
parameters cannot be predicted.
Example : region_northeast can be determined according to related columns.
region_northwest region_southeast region_southwest
0 0 0
'''
dataset_new.drop(['region_northeast', 'sex', 'smoker'], axis=1, inplace = True)
dataset_new
| age | bmi | children | charges | Sex_int | smoker_int | region_northwest | region_southeast | region_southwest | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 19 | 27.900 | 0 | 16884.92400 | 0 | 1 | 0 | 0 | 1 |
| 1 | 18 | 33.770 | 1 | 1725.55230 | 1 | 0 | 0 | 1 | 0 |
| 2 | 28 | 33.000 | 3 | 4449.46200 | 1 | 0 | 0 | 1 | 0 |
| 3 | 33 | 22.705 | 0 | 21984.47061 | 1 | 0 | 1 | 0 | 0 |
| 4 | 32 | 28.880 | 0 | 3866.85520 | 1 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1333 | 50 | 30.970 | 3 | 10600.54830 | 1 | 0 | 1 | 0 | 0 |
| 1334 | 18 | 31.920 | 0 | 2205.98080 | 0 | 0 | 0 | 0 | 0 |
| 1335 | 18 | 36.850 | 0 | 1629.83350 | 0 | 0 | 0 | 1 | 0 |
| 1336 | 21 | 25.800 | 0 | 2007.94500 | 0 | 0 | 0 | 0 | 1 |
| 1337 | 61 | 29.070 | 0 | 29141.36030 | 0 | 1 | 1 | 0 | 0 |
1337 rows × 9 columns
def sns_graph(df, parameter_x, parameter_y, hue):
b = sns.scatterplot(data = df, x = parameter_x, y = parameter_y, hue = hue)
plt.title( str(parameter_x) + ' - ' + str(parameter_y), fontweight = 'bold', fontsize=15 )
b.set_xlabel(str(parameter_x),fontsize=14,fontweight = 'bold')
b.set_ylabel(str(parameter_y),fontsize=14, fontweight = 'bold')
b.tick_params(labelsize=14)
sns_graph(dataset, 'age', 'charges', None)
# When the Age Insurance Payment graph is drawn, it is clearly seen that the insurance payment is concentrated in three different regions.
# It can be interpreted from the graph that insurance payments increase as you get older.
sns_graph(dataset, 'age', 'charges', 'region')
# It is seen that there is not much difference in insurance payments according to regions.
sns_graph(dataset, 'age', 'charges', 'smoker')
# It is observed that the general population of non-smokers makes insurance payments of 15,000 or less.
fig = px.scatter_3d(dataset, x="age", y= 'bmi', z="charges",color = 'smoker',
color_continuous_scale=px.colors.sequential.RdBu, title=" Age- Charges- BMI")
fig.update_layout(showlegend=False, paper_bgcolor = 'rgba(0, 0, 0, 0)', plot_bgcolor = 'rgba(0, 0, 0, 0)' )
fig.update_xaxes(linewidth = 2, linecolor ='black')
fig.update_yaxes(linewidth = 2, linecolor ='black')
fig.show()
dataset.groupby(by = ['region', 'smoker','sex']).count()
| age | bmi | children | charges | Sex_int | smoker_int | |||
|---|---|---|---|---|---|---|---|---|
| region | smoker | sex | ||||||
| northeast | no | female | 132 | 132 | 132 | 132 | 132 | 132 |
| male | 125 | 125 | 125 | 125 | 125 | 125 | ||
| yes | female | 29 | 29 | 29 | 29 | 29 | 29 | |
| male | 38 | 38 | 38 | 38 | 38 | 38 | ||
| northwest | no | female | 135 | 135 | 135 | 135 | 135 | 135 |
| male | 131 | 131 | 131 | 131 | 131 | 131 | ||
| yes | female | 29 | 29 | 29 | 29 | 29 | 29 | |
| male | 29 | 29 | 29 | 29 | 29 | 29 | ||
| southeast | no | female | 139 | 139 | 139 | 139 | 139 | 139 |
| male | 134 | 134 | 134 | 134 | 134 | 134 | ||
| yes | female | 36 | 36 | 36 | 36 | 36 | 36 | |
| male | 55 | 55 | 55 | 55 | 55 | 55 | ||
| southwest | no | female | 141 | 141 | 141 | 141 | 141 | 141 |
| male | 126 | 126 | 126 | 126 | 126 | 126 | ||
| yes | female | 21 | 21 | 21 | 21 | 21 | 21 | |
| male | 37 | 37 | 37 | 37 | 37 | 37 |
**Outlier Detection & Remove Outliers**
def box_plot (df, parameter):
plt.rcParams["figure.figsize"] = [5.50, 5.50]
sns.boxplot(data = pd.DataFrame(df[parameter]))
plt.title(str(parameter)+ ' Box Plot', fontsize=18, fontweight ='bold')
plt.tick_params(axis='both', which='major', labelsize=14)
plt.show()
[box_plot(dataset_new, parameter) for parameter in ['age', 'bmi', 'children','charges']]
[None, None, None, None]
def outlier_detection(df, column_name):
df_out = df.copy()
Q1 = np.percentile(df[str(column_name)], 25,
interpolation = 'midpoint')
Q3 = np.percentile(df[str(column_name)], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
# Upper bound
upper = np.where(df[str(column_name)] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df[str(column_name)] <= (Q1-1.5*IQR))
''' Removing the Outliers '''
df_out.drop(upper[0], inplace = True)
df_out.drop(lower[0], inplace = True)
print(df.shape[0] - df_out .shape[0] , " rows are removed from data")
return df_out
dataset_wo_outliers = outlier_detection(dataset_new,'bmi')
9 rows are removed from data
sns_graph(dataset_wo_outliers, 'age', 'charges', None)
dataset_wo_outliers = dataset_wo_outliers[dataset_wo_outliers['charges'] < 50000]
dataset_wo_outliers.reset_index(drop=True,inplace=True)
sns_graph(dataset_wo_outliers, 'age', 'charges', None)
**Correlation of Independent Variables:**
'''
In order to estimate the health insurance payment, by looking at the bilateral relations of the observed variables with each other, the variables
dependencies can be checked.
With values close to -1, there is a strong inverse ratio between the two variables.
With values close to 1, there is a strong direct ratio between the two variables.
With values close to 0, there is no correlation.
'''
independent_variables = ['age','bmi',
'children','Sex_int',
'smoker_int','region_northwest',
'region_southeast','region_southwest']
dependent_variable = ['charges']
plt.rcParams["figure.figsize"] = [15.50, 12.50]
plt.rcParams["figure.autolayout"] = True
corr= dataset_wo_outliers[independent_variables].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr,cmap="RdYlGn", annot = True, annot_kws={"size": 14, "va": "center_baseline", "color": "black"},mask = mask,linewidth=3, vmin=-1, vmax=1)
plt.title('Correlation Matrix', fontweight = 'bold', fontsize = 16)
Text(0.5, 1.0, 'Correlation Matrix')
When we look at the correlation map, it can be said that there is no linear correlation between the independent variables.
**Conformity to Normal Distribution**
https://ravenfo.com/2021/07/11/normal-dagilim-python-normallik-testi/
**Shapiro-Wilk Test**
! It is not recommended to be used in cases where the sample size is larger than 5,000, as it may create bias.
def Shapiro_Wilk_Test(df, parameter):
print(str(parameter) + " test whether a data set is only normally distributed" )
w, p = stats.shapiro(df[parameter])
alpha = 0.05
print('W Statictic:', w, '\np value:',p)
if p > alpha:
print('The H0 hypothesis cannot be rejected. There is not enough evidence to say that the data is not normally distributed.\n')
else:
print('H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.\n')
[Shapiro_Wilk_Test(dataset_wo_outliers, parameter) for parameter in ['age', 'bmi']]
age test whether a data set is only normally distributed W Statictic: 0.9447934627532959 p value: 8.067505035029967e-22 H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed. bmi test whether a data set is only normally distributed W Statictic: 0.9945359230041504 p value: 9.369832696393132e-05 H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.
[None, None]
Although the W statistic are close to 1, the value is not statistically significant since the p-value is smaller than the alpha value at the 95% confidence level..
**Kolmogorov-Smirnov Test**
def Kolmogorov_Smirnov_Test (df, parameter):
print(str(parameter) + " test whether a data set is only normally distributed." )
ks, p = stats.kstest(df[parameter], "norm")
alpha = 0.05
print('ks Statistic:', ks, '\np value:',p)
if p > alpha:
print('The H0 hypothesis cannot be rejected. There is not enough evidence to say that the data is not normally distributed.\n')
else:
print('H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.\n')
Standardization:
**$z$ = $ x - \sigma \over \mu $**
x : Value to be converted
$\sigma$ : The average of the variable to be converted
$\mu$ : The standard deviation of the variable to be converted
def standardization(df, parameter):
df[parameter] = ( df[parameter]- df[parameter].mean() ) / df[parameter].std()
dataset_wo_outliers_stand = dataset_wo_outliers.copy()
[standardization(dataset_wo_outliers_stand, parameter) for parameter in ['age', 'bmi', 'children']]
dataset_wo_outliers_stand
| age | bmi | children | charges | Sex_int | smoker_int | region_northwest | region_southeast | region_southwest | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.434282 | -0.447437 | -0.912024 | 16884.92400 | 0 | 1 | 0 | 0 | 1 |
| 1 | -1.505455 | 0.524076 | -0.083936 | 1725.55230 | 1 | 0 | 0 | 1 | 0 |
| 2 | -0.793725 | 0.396637 | 1.572239 | 4449.46200 | 1 | 0 | 0 | 1 | 0 |
| 3 | -0.437860 | -1.307234 | -0.912024 | 21984.47061 | 1 | 0 | 1 | 0 | 0 |
| 4 | -0.509033 | -0.285242 | -0.912024 | 3866.85520 | 1 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1317 | 0.772082 | 0.060663 | 1.572239 | 10600.54830 | 1 | 0 | 1 | 0 | 0 |
| 1318 | -1.505455 | 0.217892 | -0.912024 | 2205.98080 | 0 | 0 | 0 | 0 | 0 |
| 1319 | -1.505455 | 1.033831 | -0.912024 | 1629.83350 | 0 | 0 | 0 | 1 | 0 |
| 1320 | -1.291936 | -0.794997 | -0.912024 | 2007.94500 | 0 | 0 | 0 | 0 | 1 |
| 1321 | 1.554986 | -0.253796 | -0.912024 | 29141.36030 | 0 | 1 | 1 | 0 | 0 |
1322 rows × 9 columns
[Kolmogorov_Smirnov_Test(dataset_wo_outliers_stand, parameter) for parameter in ['age', 'bmi']]
age test whether a data set is only normally distributed. ks Statistic: 0.07906046056549959 p value: 1.2358016493060771e-07 H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed. bmi test whether a data set is only normally distributed. ks Statistic: 0.02605340424804603 p value: 0.32522439680069026 The H0 hypothesis cannot be rejected. There is not enough evidence to say that the data is not normally distributed.
[None, None]
There is insufficient evidence at the 95% confidence interval to state that the Body Mass Index is not normally distributed according to the Kolmogorov-Smirnov test.
def dist_plot(df, parameter):
sns.displot(df[parameter])
plt.xlabel(str(parameter), fontsize=14);
plt.ylabel('Count', fontsize=14);
plt.title(str(parameter)+ ' Distribution', fontsize=18, fontweight ='bold')
plt.tick_params(axis='both', which='major', labelsize=14)
plt.show()
print('Skewness:', stats.skew(df[parameter]))
[dist_plot(dataset_wo_outliers, parameter) for parameter in ['age', 'bmi', 'children','charges']]
Skewness: 0.06360208845683453
Skewness: 0.25427518233179686
Skewness: 0.93006284892155
Skewness: 1.44292532828262
[None, None, None, None]
def QQ_plot(df, parameter):
plt.rcParams["figure.figsize"] = [4.50, 4.50]
plt.rcParams["figure.autolayout"] = True
stats.probplot(df[str(parameter)], dist = "norm", plot = plt)
plt.title(str(parameter)+ " Q-Q Graph", fontsize=18, fontweight ='bold')
plt.show()
According to the distribution tables given above:
Variables other than Body Mass Index must be transformed for normal distribution.
[QQ_plot(dataset_wo_outliers, parameter) for parameter in ['age', 'bmi']]
[None, None]
**Normal Transform**
It is a statistical technique known to have a corrective effect on skewed data. The function determines the type of transforming the non-normal distribution to the normal distribution according to the lambda (λ) parameter. For Box-Cox, the lambda (λ) parameter has the range -5 <λ <5. The values of the lamp and the conversion type opposite are as follows.
def norm_transform(method,df, parameter) :
# Orjinal değişken
x_original = df[parameter].sort_values()
if method == 'Box-Cox':
x_sr, lam = stats.boxcox(x_original)
elif method == 'Reciprocal Transformation':
x_sr = 1 / x_original
elif method == 'Square-Root':
x_sr = x_original ** (1/2)
df[parameter] = x_sr
# original distribution
mean, std = stats.norm.fit(x_original, loc=0)
pdf_norm = stats.norm.pdf(x_original, mean, std)
# Box-Cox distribution
mean, std = stats.norm.fit(x_sr, loc=0)
pdf_norm_sk = stats.norm.pdf(x_sr, mean, std)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
# Original values and their distributions
ax1.hist(x_original, bins= 20 , density=True
,color = "grey", ec= "skyblue")
ax1.plot(x_original, pdf_norm
,color = "red", linewidth=4, linestyle=':')
ax1.set_ylabel('Probability')
ax1.set_title('Original Shape')
# Box-Cox transformation
ax2.hist(x_sr, bins= 20, density=True
,color = "green", ec="skyblue")
ax2.plot(x_sr, pdf_norm_sk
,color = "red", linewidth=4, linestyle=':')
print(str(parameter))
ax2.set_title(str(method) + ' Transformation')
plt.show()
df[parameter] = x_sr
dataset_new_box_cox = dataset_wo_outliers.copy()
[norm_transform('Box-Cox', dataset_new_box_cox, parameter) for parameter in ['age']]
age
[None]
dataset_new_box_cox
| age | bmi | children | charges | Sex_int | smoker_int | region_northwest | region_southeast | region_southwest | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.931435 | 27.900 | 0 | 16884.92400 | 0 | 1 | 0 | 0 | 1 |
| 1 | 7.931435 | 33.770 | 1 | 1725.55230 | 1 | 0 | 0 | 1 | 0 |
| 2 | 7.931435 | 33.000 | 3 | 4449.46200 | 1 | 0 | 0 | 1 | 0 |
| 3 | 7.931435 | 22.705 | 0 | 21984.47061 | 1 | 0 | 1 | 0 | 0 |
| 4 | 7.931435 | 28.880 | 0 | 3866.85520 | 1 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1317 | 19.129377 | 30.970 | 3 | 10600.54830 | 1 | 0 | 1 | 0 | 0 |
| 1318 | 19.129377 | 31.920 | 0 | 2205.98080 | 0 | 0 | 0 | 0 | 0 |
| 1319 | 19.129377 | 36.850 | 0 | 1629.83350 | 0 | 0 | 0 | 1 | 0 |
| 1320 | 19.129377 | 25.800 | 0 | 2007.94500 | 0 | 0 | 0 | 0 | 1 |
| 1321 | 19.129377 | 29.070 | 0 | 29141.36030 | 0 | 1 | 1 | 0 | 0 |
1322 rows × 9 columns
[Shapiro_Wilk_Test(dataset_new_box_cox, parameter) for parameter in ['age']]
age test whether a data set is only normally distributed W Statictic: 0.9440970420837402 p value: 5.959806021185804e-22 H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.
[None]
'''
The parameters are standardized for Kolmogorov-Smirnov Test .
'''
dataset_new_box_cox_stand = dataset_new_box_cox.copy()
[standardization(dataset_new_box_cox_stand, parameter) for parameter in ['age', 'bmi']]
dataset_new_box_cox_stand
| age | bmi | children | charges | Sex_int | smoker_int | region_northwest | region_southeast | region_southwest | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.618489 | -0.447437 | 0 | 16884.92400 | 0 | 1 | 0 | 0 | 1 |
| 1 | -1.618489 | 0.524076 | 1 | 1725.55230 | 1 | 0 | 0 | 1 | 0 |
| 2 | -1.618489 | 0.396637 | 3 | 4449.46200 | 1 | 0 | 0 | 1 | 0 |
| 3 | -1.618489 | -1.307234 | 0 | 21984.47061 | 1 | 0 | 1 | 0 | 0 |
| 4 | -1.618489 | -0.285242 | 0 | 3866.85520 | 1 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1317 | 1.638929 | 0.060663 | 3 | 10600.54830 | 1 | 0 | 1 | 0 | 0 |
| 1318 | 1.638929 | 0.217892 | 0 | 2205.98080 | 0 | 0 | 0 | 0 | 0 |
| 1319 | 1.638929 | 1.033831 | 0 | 1629.83350 | 0 | 0 | 0 | 1 | 0 |
| 1320 | 1.638929 | -0.794997 | 0 | 2007.94500 | 0 | 0 | 0 | 0 | 1 |
| 1321 | 1.638929 | -0.253796 | 0 | 29141.36030 | 0 | 1 | 1 | 0 | 0 |
1322 rows × 9 columns
[Kolmogorov_Smirnov_Test(dataset_new_box_cox_stand, parameter) for parameter in ['age']]
age test whether a data set is only normally distributed. ks Statistic: 0.07619713993449828 p value: 4.022893110721751e-07 H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.
[None]
dataset_new_box_cox_1= dataset_wo_outliers.copy()
[norm_transform('Box-Cox', dataset_new_box_cox_1, parameter) for parameter in ['charges']]
charges
[None]
dataset_new_box_cox_stand_1 = dataset_new_box_cox_1.copy()
[standardization(dataset_new_box_cox_stand_1, parameter) for parameter in ['charges']]
[Kolmogorov_Smirnov_Test(dataset_new_box_cox_stand_1, parameter) for parameter in ['charges']]
charges test whether a data set is only normally distributed. ks Statistic: 0.03920828435055035 p value: 0.03342985390288211 H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.
[None]
**Multiple Regression**
AIC means Akaike’s Information Criteria. --> present the danger that it would outfit
BIC means Bayesian Information Criteria. --> present the danger that it would underfit
http://www.differencebetween.net/miscellaneous/difference-between-aic-and-bic/#ixzz7p8lyEOdo
dataset_wo_outliers.exog = sm.add_constant(dataset_wo_outliers[independent_variables], prepend=False)
# Fit and summarize OLS model
mod = sm.OLS(dataset_wo_outliers[dependent_variable], dataset_wo_outliers.exog )
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: charges R-squared: 0.753
Model: OLS Adj. R-squared: 0.752
Method: Least Squares F-statistic: 501.3
Date: Tue, 20 Jun 2023 Prob (F-statistic): 0.00
Time: 20:03:33 Log-Likelihood: -13337.
No. Observations: 1322 AIC: 2.669e+04
Df Residuals: 1313 BIC: 2.674e+04
Df Model: 8
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
age 256.0187 11.532 22.201 0.000 233.396 278.642
bmi 327.4143 27.966 11.708 0.000 272.552 382.277
children 511.5555 133.402 3.835 0.000 249.852 773.259
Sex_int -97.8954 322.872 -0.303 0.762 -731.298 535.507
smoker_int 2.322e+04 402.789 57.636 0.000 2.24e+04 2.4e+04
region_northwest -453.6206 461.579 -0.983 0.326 -1359.133 451.892
region_southeast -1013.7175 463.753 -2.186 0.029 -1923.495 -103.940
region_southwest -1023.1013 463.766 -2.206 0.028 -1932.905 -113.298
const -1.154e+04 964.105 -11.971 0.000 -1.34e+04 -9649.958
==============================================================================
Omnibus: 261.911 Durbin-Watson: 2.096
Prob(Omnibus): 0.000 Jarque-Bera (JB): 527.858
Skew: 1.149 Prob(JB): 2.38e-115
Kurtosis: 5.075 Cond. No. 313.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\msahin42\AppData\Local\Temp\ipykernel_10984\2032741484.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
**Formula**
$Charge$ = -11540 + 256.0187 Age + 327.4143 BMI + 511.5555 Children - 97.8954 Sex + 23220 smoker - 453.6206 Northwest - 1013.7175 Southeast - 1023.1013 Soutwest
plt.rcParams["figure.figsize"] = [12.50, 7.50]
plt.scatter (dataset_wo_outliers.age , dataset_wo_outliers['charges'] , color = "red")
plt.scatter( dataset_wo_outliers.age, res.predict(dataset_wo_outliers.exog), color = "blue")
plt.xlabel ("age", fontweight = 'bold', fontsize=13)
plt.ylabel ( "Charges", fontweight = 'bold', fontsize=13)
plt.title ( "Age - Charges" , fontweight = 'bold', fontsize=16)
plt.show()
**Finding Beta Coefficient**
dataset_new_stand = dataset_wo_outliers.copy()
[standardization(dataset_new_stand, parameter) for parameter in ['age', 'bmi', 'children']]
dataset_new_stand.exog = sm.add_constant(dataset_new_stand[independent_variables], prepend=False)
# Fit and summarize OLS model
mod = sm.OLS(dataset_new_stand[dependent_variable], dataset_new_stand.exog )
result = mod.fit()
print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: charges R-squared: 0.753
Model: OLS Adj. R-squared: 0.752
Method: Least Squares F-statistic: 501.3
Date: Tue, 20 Jun 2023 Prob (F-statistic): 0.00
Time: 20:03:33 Log-Likelihood: -13337.
No. Observations: 1322 AIC: 2.669e+04
Df Residuals: 1313 BIC: 2.674e+04
Df Model: 8
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
age 3597.1296 162.026 22.201 0.000 3279.272 3914.987
bmi 1978.2767 168.973 11.708 0.000 1646.789 2309.764
children 617.7554 161.096 3.835 0.000 301.721 933.789
Sex_int -97.8954 322.872 -0.303 0.762 -731.298 535.507
smoker_int 2.322e+04 402.789 57.636 0.000 2.24e+04 2.4e+04
region_northwest -453.6206 461.579 -0.983 0.326 -1359.133 451.892
region_southeast -1013.7175 463.753 -2.186 0.029 -1923.495 -103.940
region_southwest -1023.1013 463.766 -2.206 0.028 -1932.905 -113.298
const 9065.7607 374.747 24.192 0.000 8330.592 9800.930
==============================================================================
Omnibus: 261.911 Durbin-Watson: 2.096
Prob(Omnibus): 0.000 Jarque-Bera (JB): 527.858
Skew: 1.149 Prob(JB): 2.38e-115
Kurtosis: 5.075 Cond. No. 5.63
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\msahin42\AppData\Local\Temp\ipykernel_10984\1824816278.py:4: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
**Conclusion and Recommendations**
Taking into account the age, number of children, place of residence, body mass index, gender, and smoking status of a new customer, insurance payments can be predicted by using the multiple regression model, which is explainable at a rate of 75.3%. The fact that an individual smokes increases the insurance payment, and it can be concluded that there is a positive relationship between age and insurance payment. Increasing body mass index also increases insurance payments. However, as mentioned in the last paragraph of the findings and evaluation section, in reality insurance payments are grouped in three different areas. The established model was somewhat inadequate in separating these three groups. During the analysis of the data, it was stated that little data were collected about smokers and there was an imbalance. For future studies, increasing the number of smokers and repeating the analyzes can make a positive contribution to the model results. In addition, with the increase in the number of data, the normality will increase and the sample will better represent the main population.